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Summary 

Azzalini & Dalla Valle (1996) have recently discussed the multivariate skew-normal distri- 
bution which extends the class of normal distributions by the addition of a shape parameter. 
The first part of the present paper examines further probabilistic properties of the distribu- 
tion, with special emphasis on aspects of statistical relevance. Inferential and other statistical 
issues are discussed in the following part, with applications to some multivariate statistics 
problems, illustrated by numerical examples. Finally, a further extension is described which 
introduces a skewing factor of an elliptical density. 
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1 Introduction 



There is a general tendency in the statistical literature towards more flexible methods, to 
represent features of the data as adequately as possible and reduce unrealistic assumptions. 
For the treatment of continuous multivariate observations within a parametric approach, one 
aspect which has been little affected by the above process is the overwhelming role played 
by the assumption of normality which underlies most methods for multivariate analysis. A 
major reason for this state of affairs is certainly the unrivaled mathematical tractability of the 
multivariate normal distribution, in particular its simplicity when dealing with fundamental 
operations like linear combinations, marginalization and conditioning, and indeed its closure 
under these operations. 

From a practical viewpoint, the most commonly adopted approach is transformation of the 
variables to achieve multivariate normality, and in a number of cases this works satisfactorily. 
There are however also problems: (i) the transformations are usually on each component 
separately, and achievement of joint normality is only hoped for; (ii) the transformed vari- 
ables are more difficult to deal with as for interpretation, especially when each variable is 
transformed using a different function; (hi) when multivariate homoscedasticity is required, 
this often requires a different transformation from the one for normality. 

Alternatively, there exist several other parametric classes of multivariate distributions to 
choose from, although the choice is not as wide as in univariate case; many of them are 
reviewed by Johnson & Kotz (1972). A special mention is due to the hyperbolic distribution 
and its generalized version, which form a very flexible and mathematically fairly tractable 
parametric class; see Barndorff-Nielsen & Blsesild (1983) for a summary account, and Blaesild 
(1981) for a detailed treatment of the bivariate case and a numerical example. 

As for extensions of distribution theory of classical statistical methods, the direction which 
seems to have been explored more systematically in this context is the extension of distribu- 
tion theory of traditional sample statistics to the case of elliptical distribution of the un- 
derlying population; elliptical distributions represent a natural extension of the concept of 
symmetry to the multivariate setting. The main results in this area are summarized by Fang, 
Kotz & Ng (1990); see also Muirhead (1982, chapters 1 and 8). 

Except for data transformation, however, no alternative method to the multivariate nor- 
mal distribution has been adopted for regular use in applied work, within the framework 
considered here of a parametric approach to handle continuous multivariate data. 

The present paper examines a different direction of the above broad problem, namely 
the possibility to extend some of the classical methods to the class of multivariate skew- 
normal distributions which has recently been discussed by Azzalini & Dalla Valle (1996). 
This distribution represents a mathematically tractable extension of the multivariate normal 
density with the addition of a parameter to regulate skewness. 

We aim at demostrating that this distribution achieves a reasonable flexibility in real data 
fitting, while it maintains a number of convenient formal properties of the normal one. In 
particular, associated distribution theory of linear and quadratic forms remains largely valid. 

More specifically, the targets of the paper are as follows: (a) to extend the analysis of the 
probabilistic aspects of the multivariate skew-normal distribution, especially when they repro- 
duce or resemble similar properties of the normal distribution; (b) to examine the potential 
applications of this distribution in statistics, with special emphasis on multivariate analysis. 
Correspondingly, after a summary of known results about the distribution, sections [3j 14| and [5J 
deal with distribution of linear and quadratic forms of skew-normal variates, and other prob- 
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abilistic aspects; sections [6] and [7] deal with issues of more direct statistical relevance, with 
some numerical examples for illustration. In addition, section [8] sketches an additional level 
of generalization by introducing a skew variant of elliptical densities. 



2 The multivariate skew-normal distribution 

We first recall the definition and a few key properties of the distribution, as given by Azzalini & 
Dalla Valle (1996) except for re-arrangement of the results. A A;-dimensional random variable 
Z is said to have a multivariate skew-normal distribution if it is continuous with density 
function 

24( 2 ;0)$(a T z), (z 6 (1) 

where 4>k{z;£l) is the fc-dimensional normal density with zero mean and correlation matrix 
Q, $(•) is the N(0, 1) distribution function, and a is a /e-dimensional vector. For simplicity, VL 
is assumed to be of full rank. 

When a = 0, ([]]) reduces to the A^O, f2) density. We then refer to a as a 'shape parameter', 
in a broad sense, although the actual shape is regulated in a more complex way, as it will 
emerge in the course of the paper. 

The above density does not allow location and scale parameters. Clearly, these are essen- 
tial in practical statistical work, but we defer their introduction until later, to keep notation 
simple as long as possible. 

The matrix Q and the vector a appearing in ([]]) were defined in Azzalini & Dalla Valle 
(1996) as functions of other quantities, namely another correlation matrix ^ and a vector 
A G R k ; hence a member of the parametric family was identified by the pair (A, \&). It is in 
fact possible to identify the member of the family directly by the pair (a, 12); i.e. this pair 
provides an equivalent parametrization of the class of densities. The proof of this fact is of 
purely algebraic nature, and it is given in an appendix, together with some related results. 
For the purposes of the present paper, this parametrization appears preferable and we shall 
adopt the notation 

Z ~ SN k (n,a) 

to indicate that Z has density function (Q]) . 
The cumulant generating function is 

K(t) = log M(t) = ±t T Qt + log{2 <$>{5 T t)} (2) 

where 

5 = m tta. (3) 

Hence the mean vector and the variance matrix are 

fi z = E{Z} = (2/tt) 1/2 S, var{Z} = Q - ^ /xj. (4) 

The following result provides a stochastic representation of Z, useful for computer gener- 
ation of random numbers and for theoretical purposes. 



Proposition 1 Suppose that 



X$\ , t ,„ ^ N ( 1 5 T 



x) -^ + i(o,n*), n< 
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where X is a scalar component and f2* is a correlation matrix. Then 

Z -- 

is SN fc (fi, a) where 



X ifX > 
-X otherwise 



a = 



(i - 5 T n- 1 s) 1/2 



n _1 <y. (5) 



Also, we shall make repeated use of the Sherman-Morrison-Woodbury formula for matrix 
inversion, which states 

{A + UBV)- 1 = A' 1 - A~ 1 UB{B + BVA^UB^BVA- 1 (6) 

for any conformable matrices, provided the inverses involved exist; see for instance Rao 
(1973, exercise 2.9, p. 33). 

3 Linear and quadratic forms 

A key feature of the multivariate normal distribution is its simplicity to handle linear and 
quadratics forms. We now explore the behaviour of the skew-normal distribution in these 
cases. 

3.1 Marginal distributions 

It is implicit in the genesis of the multivariate skew-normal variate, as described by Azzalini 
& Dalla Valle (1996), that the marginal distribution of a subset of the components of Z is still 
a skew-normal variate. In the marginalization operation, the (A, parametrization works in 
a very simple manner, since one only needs to extract the relevant components of A and 
With the (f2, a) parametrization, specific formulae must be developed. 

Proposition 2 Suppose that Z ~ SN fc (S7, a) and Z is partitioned as Z T = (Zj , Zj) of dimen- 
sions h and k — h, respectively; denote by 

the corresponding partitions ofQ and a. Then the marginal distribution of Z\ is SN h (J7n, a±), 
where 

_ ai + n^n 12 a 2 o -O O O^O 

OL\ — — T-r \T7o ' "22-1 — "22 — l'2li'ii s '12- 

(1 + a 2 ' S222-10J2) i/Z 

The proof follows from straightforward integration, with the aid of Proposition 4 of Azzalini 
& Dalla Valle (1996). 
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3.2 Linear transforms 

Proposition 3 If Z ~ SN&(fi, a), and A is a non-singular k x k matrix such that A T QA is a 
correlation matrix, then 

A T Z ~ SN k (A r nA,A- 1 a). 

The proof follows from standard rule of transformation of random variables. The above 
condition that A T VLA is a correlation matrix is there for the sake of simplicity of exposition, 
and it can be removed; see section [5j 

Proposition 4 For a variable Z ~ SNfc(fi 3 a), there exists a linear transform Z* = A*Z such 
that Z* ~ SNfc(ifc, a*) where at most one component of a* is not zero. 

Proof. By using the factorization 17 = C T C, we first transform Z into a variable Y = (C T )~ 1 Z 
such that Y ~ SNfc(ifc, Ca). Now consider an orthogonal matrix P with one column on the 
same direction of Ca, and define Z* = P T Y which fulfills the conditions. 

The above result essentially defines a sort of 'canonical form' whose components are mu- 
tually independent, with a single component 'absorbing' all asymmetry of the multivariate 
distribution. This linear transformation plays a role similar to the one which converts a 
multivariate normal variable into a spherical form. Further, notice that the component trans- 
formations of A* are invertible; hence it is possible to span the whole class SNfe(J7, a) starting 
from Z* and applying suitable linear transformations. The density of Z* is of the form 

k 

2 n>(u;) $(c4u m ) 
i=l 

where 

a T Oa (7) 



is the only non-zero component of a*. 

For the rest of this section, we examine conditions for independence among blocks of 
components of a linear transform Y = A T Z. Before stating the main conclusion, we need the 
following intermediate result. 



Proposition 5 Let Z ~ SNfc(f2, a) and A is as in Proposition^ and consider the linear trans- 
form 

r - A ™-GNtY 

where the matrices A\, . . . , A^ have mi, ... , rrih columns, respectively. Then 

Yi ~ SN m< (fiy^ayj 

where 

O -A^OA n (AjQAj)~ l AjQa 

U Yi — A i ilAi, ay t — 777 

(1 + a T {n - flMAjnA^Ajtyay 12 



5 



Proof. Without a loss of generality, we consider the case h = 2 and i = 1. Write A = (A%, A2) 
and denote its inverse by 

where the number of columns of the blocks of A matches the number of rows of the blocks 
of A~ x . Since AA^ 1 = I k , then the identity A 1 A ( f 1 ' > + A 2 A < f 1 ^ = I k holds. On partitioning 
A T VlA in an obvious way, and 

UK* 

the result follows after some algebra by applying Proposition [2] to the parameters of A T Z, 
taking into account the above identity. 

We now turn to examine the issue of independence among blocks of a linear transform 
A T Z where A satisfies the condition of Proposition [3j To establish independence among the 
Yi's, a key role is played by the <£(•) component in ([]]). Since &(u + v) cannot be factorized 
as the product <3?(u) $(v), it follows that at most one of the Y{ can be a 'proper' skew-normal 
variate, while the others must have the skewness parameter equal to 0, hence be regular 
normal variates, if mutual independence holds. 

Proposition 6 If Z ~ SNfc(fi,a), and A T QA is a positive definite correlation matrix, then the 
variables (¥]_,..., !/,) defined by (HJ) are independent if and only if the following conditions hold 
simultaneously: 

(a) AjnAj = Ofori^j, 

(b) AjVLa 7^ 0/or at most one i. 

Proof. Prove sufficiency first. By Proposition [3] and condition (a), the joint distribution of Y 

is SNfc(Oy, ay) where 

Q Y = diag(AjnA 1 ,...,AlflA h ), 

/(AjQA^AjQa- 

ay = (A T nA)- 1 A T n a = I : 

{(AlQAh^AlQa, 

If condition (b) is satisfied too, only one of the blocks of ay is not zero. Hence the joint 
density can be factorized in obvious manner. 

To prove necessity, note that if independence holds the density of Y can be factorized as 
the product of the densities of the Yi's, given by Proposition [5j Since the function cannot 
be factorized, only one block of ay can be not zero, and Uy must be a block-diagonal matrix. 
These requirements can be met only if conditions (a) and (b) are satisfied. 

Notice that the parameters of the Y/s are equal to the corresponding blocks of (p.y,a Y ) 
only if independence holds. 
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3.3 Quadratic forms 



One appealing feature of the one-dimensional skew-normal distribution is that the square of 
a random variate of this kind is a x\- This property carries on in the multivariate case since 
Z T £1~ 1 Z ~ xt> irrespectively of a. These facts are special cases of the more general results 
presented below. 

Proposition 7 If Z ~ SNfc(f2, a), and B is a symmetric positive semi-definite k x k matrix of 
rank p such that BflB = B, then Z 1 BZ ~ \%- 

Proof. Consider first the case of a random variable Y ~ SN P (J P , a). Since Y T Y = Y T AA T Y 
for any orthogonal matrix A, hence in particular it holds for a matrix having a column on 
the same direction of a, i.e. we are considering the canonical form associated to Y. It then 
follows that Y T Y ~ Xp independently of a. 

In the general case, let us write B = MM T where M is a full-rank k x p matrix (p < k), 
and notice that M T VLM = I p is equivalent to B£IB = B; to see this, it is sufficient to left- 
multiply each side of the latter equality by(M T M)~ l M T and right-multiply by its transpose. 
Then Z T BZ = Y T Y where Y = M T Z ~ SN P (I P , ay) for some suitable vector ay. Therefore 
the statement holds because Y T Y ~ Xp- 

Corollary 8 If Z ~ SN fc (f2, a), and C is a full-rank k x p matrix (p < k), then 

z T c(c T ncy 1 c T z ~ Xp. 

Proposition 9 If Z ~ SNfc(r2, a), and is a symmetric positive semi-definite k x k matrix of 
rank pi (i = 1, 2, . . . , h) such that 

(a) BiflBj = 0/or i + j, 

(b) a T VtBiQ,a ^ 0/or at most one i, 

then the quadratic forms Z T BiZ (i = 1, 2, . . . , h) are mutually independent. 

Proof. Similarly to the proof of Proposition [TJ write Bi = MiMj where Mj has rank pj. 
Clearly the quadratic forms Z 1 B{Z are mutually independent if this is true for the linear 
forms Mj Z. It is easy to see that MjVtMj = is equivalent to BjVlBj = for i ^ j; 
similarly MjQ,a / is equivalent to a^fil^Oa / 0. This completes the proof. 

Proposition 10 (Fisher-Cochran) If Z ~ SNjfc(Ifc,a) and B\,...,Bh are symmetric k x k 
matrices of rank pi, . . . ,ph, respectively, such that Y^Bi = 1^ and B^a / for at most one 
choice of i, then the quadratic forms Z T BiZ are independent x 2 Pi if and only ifY^Pi = k. 

Proof. The proof follows the steps of the usual one of Fisher-Cochran theorem, as given for 
instance by Rao (1973, p. 185 ff.), taking into account Proposition [9] for independence of the 
quadratic forms, and Proposition [7] as for their marginal distributions. 

It would be possible to develop this section via a different approach, on the basis of 
Proposition [TJ For most of the results, this route would offer a simple treatment, but for 
some others it would be quite cumbersome, especially for the results about independence of 
components. 
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4 Cumulants and indices 

To study higher order cumulants besides those given in Section El we need some preliminary 
results about the cumulants of the half-normal distribution, i.e. the distribution of V = \U\, 
where U ~ iV(0, 1). Its cumulant generating function is 

K v (t) = \t 2 + Q (t) 

where 

Co(x) = log(2$(x)). 

For later use, define 

d m 

Cmi x ) = j^Co(aj) (m = l,2,...). 

Clearly, Clip) = 4>{ x )/^{ x )\ the subsequent derivatives can be expressed as functions of the 
lower order derivatives, e.g. 

O2O*) = -Ci(x){x + &(x)}, 

(six) = -C 2 (x){x + Ci(x)}-Ci(x){l + C 2 (x)}, 

d(x) = -C 3 (x){x + 2Ci(x)}-2C 2 (x){l + C 2 (x)}, 

hence as functions of Ci( x )- Computation of Cm, at x = gives the corresponding cumulant 
k^. Unfortunately, it is not clear how to obtain a closed or recursive formula for the £ m (x)'s. 
An alternative route for computing 4n is as follows: since V ~ (x?) 1 ^ 2 then 

E{v m } = =-=rT f 1 



A V 2 

which admits the recurrence formula 

E{V m } = (m- l)E{V m ~ 2 } , (m>2). 

Hence the cumulant can be obtained from the set E{y r } , r = 1, . . . , m, using well-known 
results; see e.g. Table 2.1.2 of David, Kendall & Burton (1966) for expressions connecting 
cumulants to moments up to order 8. In particular, we obtain for V that 

4 = (2/vr) 1 / 2 (4/tt - 1), k\ = 4(2 - 6/^/vr. 
Returning to cumulant generating function ([2]), its first two derivatives are 

— ^ = Qi + Ci(x)<5, = tt + C 2 (x)M T 

where x = 5 T t, and its evaluation at t = confirms ©. Higher order cumulants are obtained 
from 

d m K(t) 

dt i dt j ...dt r =Ux)SiSj --- dr 
which needs to be evaluated only at x = where 

Cm( x )\ x= o = K m 
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which can has been obtained as described above. 

One use of these expressions is to obtain summary indicators for the SN& distribution. 
The most popular ones are those introduced by Mardia (1970, 1974) to measure multivariate 
skewness and kurtosis. In our case, the index of skewness takes the form 

7i, k = 0i,k = {^X) 2 /2 ^2 ^r5 s 6 t 8 r >5 s >5t'(y rr <y ss & tt 

rst r's't' 

where S = $7 — fi z ^J = (a rs ) with inverse X -1 = (a rs ). Similarly, the index of kurtosis is 

72,fe = 02,k-k(k + 2)=4J2 5 r S ^u(T rs a tu 

rstu 
-1, ^ 



= 2(tt-3)(^E-V 

There exists an alternative multivariate index of skewness discussed in the literature; 
see e.g. McCullagh (1987, p.40). However this differs from 71^ only by a different way of 
matching the indices of the cumulants, but this has no effect in the present case because of 
the special pattern of the cumulants of order higher than 2. Hence, in our case the two indices 
of skewness coincide. 

Using ©, one can re-write 

Tv-l, mJ°~V* 



1 - /J.JQ, V2 



which allows easier examination of the range of fijT, 1 fi z , by considering the range of 
5 T £l~ 1 5. On using ([3]), we write 



1 + a T fla 1 + a 
where a is the square of a* m , defined by ©. Since a spans [0, 00), then 

/ijE" V* = J a 9 , e [0, 2/(tt - 2)) 
7T + [tt — 2)a 

and the approximate maximal values for 71^ and 72^ are 0.9905, and 0.869, respectively, in 
agreement with the univariate case. Since both 71^ and j 2 ,k depend of a) only via a* m , 
this reinforces the role of the latter as the summary quantity of the distribution shape. 



5 Some extensions 

5.1 Location and scale parameters 

For the subsequent development of the paper, we need to introduce location and scale param- 
eters, which have been omitted in the expression (Q]) of the density of Z. Write then 

Y = £ + ujZ (9) 
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where 

£ = • • • ,£fc) T , <*> = diag(wi, . . . ,w fc ) 

are location and scale parameters, respectively; the components of u are assumed to be 
positive. The density function of Y is 

2cj )k {y-^)^{a Y uj- 1 {y-i)} (10) 

where 

= U!ft z UJ 

is a covariance matrix and, from now on, Q z replaces the symbol used in the previous 
sections. Hence, for instance, © must now be read with replaced by VL Z . We shall use the 
notation 

Y ~SN fc (£,0,a) 

to indicate that Y has density function (fTOl) . In the sequel, we shall also use the notation 
yA to denote the diagonal matrix of the square root of the diagonal elements of a positive 
definite matrix A; hence, for instance, uj = a/O- 

Earlier results on linear and quadratic forms for Z carry on for Y, apart for some slight 
complication in the notation. For instance, for a linear transform A T Y where A is a k x h 
matrix, a simple extension of Proposition [5] gives 

X = A T Y ~SN h (U,n x ,a x ) (ID 

where 

ix = A T £, n x = A T QA, a a 



\ \l/2 



(l + a T (Q z - Btt^B' 
and 

uj x = \/Qx, B = u; _1 OA 

Similar extensions could be given for other results of Section [3] For later reference, we write 
the new form of the cumulant generating function 

K(t) = t T £ + ±t T nt + log{2 <S>(5 T ut)}. (12) 
5.2 Conditional distributions 

Suppose that Y has density function (fTOl) . and it is partitioned in two components, Y\ and 
Y2, of dimensions h and k — h, respectively, with a corresponding partition for £, O and a. To 
examine the distribution of Y2 conditionally on Y\ =y\, write 

f2 =6 + ^21^11 (j/l-fl), 022-1 = 022-0210^012, tti = — — ^ ^7^, 

where 

UJ\ = a/ On, ^2 — \/^22' 2 2-l — ^2 1 022-1^2 ^ 

Here ^ an d O221 are given by the usual formulae for the conditional mean and variance of 
a normal variable, and a.\ is the shape parameter of the marginal distribution of Y%. After 



10 



some straightforward computation, it follows that the cumulant generating function of the 
conditional distribution is 

K c {t) = t T & + ±t T Q 2 2-it + log $(x + 5jto 2 t) - log $(s ) 

where 

x = aju^iyi - 6) 

and 5 2 is computed similarly to ([3]), with and a replaced by Cl 22 .i and a 2 , respectively. This 
gives immediately 

E{F 2 |yi} = & + ClOcok, var{y 2 |yi} = n 22 .! + C 2 (xo)tt t (13) 
where r = ^2^2; higher order cumulants of order m are of the form 

Cm(xo) T r T s ■ ■ ■ Tu , (m > 2), 

m terms 

where r r denotes the r-th component of r. 

Clearly, K c (t) is of form (fT2l) . This special case occurs only if = 0; this condition is 
essentially equivalent to at\ = 0, i.e. Y\ is marginally normal. 

The expression of the conditional density in the general case is easily written down, 
namely 

4>k-h(V2 - &, n 22-l) H<4^2 1 {V2 ~ £D + x'o}/Hxo) (14) 

where x' Q = (l + O22.1 02) ^o- In the case k — h = 1, this distribution has been discussed 
by several people, including Chou & Owen (1984), Azzalini (1985), Cartinhour (1990) and 
Arnold et al. (1993). From (fT4lh it is easy to see that conditions for independence among 
components are the same of the unconditional case, with O22 1 and a 2 replacing f2 and a, 
confirming again the usefulness of the adopted parametrization. 

The shape of ( fT4l ) depends on a number of ingredients; however, for most cases, the plot 
of this density function displays a remarkable similarity with the one of the skew-normal 
density. This similarity suggests the approximation of the conditional density by a skew- 
normal density which matches cumulants up to the third order. 

The resulting equations allow explicit solution, except for extreme situations when the 
exact conditional density has an index of skewness outside the range of the skew-normal one; 
these unfeasible cases are very remote. In the overwhelming majority of cases, the equations 
can be solved, and the approximate density is close to the exact one. Figure Q] shows the 
contour levels of the two densities for two combinations of parameter values when k — h = 2; 
the left panel shows one of the worst cases which have been observed, while the right panel 
displays a much better, and also more frequently observed, situation. 

Besides the generally small numerical discrepancy between the approximate and the exact 
density, the following two properties hold. 

o Independence is respected. If two components of Y 2 are independent conditionally on 
Y\ = yi with respect to the exact conditional density, so they are with respect to the 
approximate one, and vice versa. 

o Interchange of marginalization and conditioning. Integrating out some components of 
Y 2 after conditioning produces the same result of integration followed by conditioning. 
This fact is obvious when using the exact density; it still holds for the approximate one. 
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Conditional multivariate SN pdf Conditional multivariate SN pdf 




8 9 10 11 12 9 10 11 12 13 



x x 

Figure 1: Contour levels of the exact (dashed lines) and approximate (continuous lines) conditional 
density of a multivariate skew-normal variable, plotted for two sets of values of the parameters and of 
the conditioning variable 

To prove the first statement, denote by (a, b) a partition of set of indices composing Yi- Condi- 
tional independence of Y a and Y\> implies that f?22-i is block diagonal and that one of the two 
components, Y a say, has no skewness; hence 5 a = and r a = 0. Therefore all off-diagonal 
blocks composing the variance in (fT3l) are 0, and the same structure must hold in the match- 
ing quantity of the approximating distribution. The converse statement can be proved sim- 
ilarly. To prove the second statement, simply notice that the approximation preserves exact 
cumulants up to the third order, which uniquely identify a member of the SN family; hence 
also the cumulants of the marginal distribution are preserved up to the same order. 

The degree of accuracy of the approximation jointly with the above two properties support 
routine use of the approximate conditional density in place of the exact one. In this sense, 
we can say that the skew-normal class of density is closed with respect to the conditioning 
operation. 

6 Statistical issues in the scalar case 
6.1 Direct parameters 

Starting from this section, we switch attention to inferential aspects, and other issues of more 
direct statistical relevance, initially by considering univariate distributions. 

Some of the issues discussed in this subsection have a close connection with the problem 
considered by Copas & Li (1997) and the sociological literature on Heckman's model refer- 
enced there; see also Aigner et al. (1977) and the literature of stochastic frontier models. 
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Figure 2: Twice relative profile loglikelihood of a (left) and contour levels of the similar function of 
(lo, a) (right) for the Otis data, when the direct parametrization is used 

In the univariate case, write Y ~ SN(^,u; 2 ,a), dropping the subscript k for simplicity. 
If a random sample y = {yi, ■ ■ ■ , y n ) T is available, the loglikelihood function for the direct 
parameters DP = (£, to, a) is 



where z = io~ l {y — £l n ) and Z{ denotes its i-th component; here l n is the n x 1 vector of all 
ones. We shall denote by a the maximum likelihood estimate (MLE) of a, and similarly for 
the other parameters. The likelihood equations are immediately written down, namely 



where pu = d(azi). There are however two sort of problems with this parametrization. 
Firstly, there is always an inflection point at a = of the profile loglikelihood. Correspond- 
ingly, at a = 0, the expected Fisher information becomes singular. This phenomenon is a 
special case of the problem studied in greater generality by Rotnitzky et al. (1999). 

In addition, the likelihood function itself can be problematic; its shape can be far from 
quadratic even when a is not near 0. This aspect is clearly illustrated by the plots given by 
Arnold et al. (1993) who have analysed a dataset of size 87, later referred to as the Otis data; 
see also Figure El which refers to the same data. 




(15) 
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For evaluation of the MLE, gradient-based methods have been considered, but better re- 
sults were obtained using the EM algorithm, with the introduction of a fictitious unobserved 
variable which is essentially \Xq\ of Proposition [TJ This method works satisfactorily, at least 
when the initial values are chosen by the method of moments. As typical for the EM al- 
gorithm, reliability rather than speed is its best feature. Methods for accelerating the EM 
algorithm are available; see for instance Meng & van Dyk (1997) and references therein. 
However, we prefer to expand in greater detail the discussion of another approach, for the 
reasons explained in the next subsection. 



6.2 Centred parameters 

To avoid the singularity problem of the information matrix at a = 0, Azzalini (1985) has 
reparameterized the problem by writing 

Y = li + ctZ , 

where 

Z° = (Z-v z )/a z , a g = (l-fi) 1/2 , 

and considering the centred parameters CP = ((J,,(T, 71) instead of the DP parameters. Here 
71 is the usual univariate index of skewness, which is equal to the square root of the multi- 
variate index of skewness of Section |4l taken with the same sign of a. Clearly, there is the 
correspondence 

£ = fj, - aa~ l n z , uj = aa z 1 . 

In the case of a regression problem, write E{Yj} = xJ/3, where Xi is a vector of p covariates 
and j3 is vector parameter. The corresponding loglikelihood is then 

l(CP) = n\og(a z /a) - £* T z + 

where 

Zi = Hz + a z a~ X (Vi ~ xJP) = Hz + cr z ri, z=(z 1} ..., z n ) T . 

In case we wanted to reformulate the regression problem in terms of direct parameters, then 
only the first component must be adjusted, namely 

tf p = tf P - oHzlo z 

in a self-explanatory notation. 

The gradient and the Hessian matrix of the loglikelihood in the CP parametrization are 
more involved than with the DP parametrization, and we confine the details in an appendix. 
The effects of the reparametrization are however beneficial in various respects and worth the 
algebraic complications, for the following reasons. 

o The reparametrization removes the singularity of the information matrix at a = 0. This 
fact was examined numerically by Azzalini (1985), and checked by detailed analytic 
computations by Chiogna (1997). 

o Although not orthogonal, the components of CP are less correlated than those of DP, 
especially /i and the 71 . This fact can be checked numerically with the aid of the expres- 
sions given in an appendix. 
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Figure 3: Twice relative profile loglikelihood of 71 (left) and contour level of the similar function of 
(a, 71) (right) for the Otis data, when the centred parametrization is used 

o The likelihood shape is generally much improved. This is illustrated by Figure [3l which 
refers to the same data of Figure [2 the left panel refers to twice the relative profile 
loglikelihood for the new shape parameter 71, and the right panel refers to the pair 
(cr, 71). There is a distinct improvement over the earlier figure, in various respects: 

o the inflection point at a = of the first panel of Figure [2] has been removed, with 
only a mild change of slope at 71 = left; 

o the overall shape of the profile loglikelihood has changed into one appreciably 
closer to a quadratic shape; 

o near the MLE point, the axes of the approximating ellipsis are now more nearly 
alligned to the orthogonal axes than before. 

o Simulation work, whose details are not reported here, showed that the marginal dis- 
tribution of £ can be bimodal when n and \a\ are small or moderate; for instance it 
happens with n = 50 , sampling from SN(0, 1,1). Such an unusual distribution of the 
MLE is in qualitative agreement with the findings of Rotnitzky et al. (1999). Again, 
this unpleasant feature disappeared with the CP parametrization, in the sense that the 
distribution of the new location parameter fi exhibited a perfectly regular behaviour. 

The advantages of CP over DP are not only on the theoretical side but also practical, since 
the more regular shape of the loglikelihood leads to faster convergence of the numerical 
maximization procedures when computing the MLE. 

For numerical computation of the MLE, we have obtained satisfactory results by adopting 
the following scheme: (i) choose initial values by the method of moments; (ii) optionally, 
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Figure 4: Simulated data points (small circles) leading to a = oo, with nonparametric density esti- 
mate (dashed curve) and parametric curve with a = 8.14 (continuous curve) 

improve these estimates by a few EM iterations; (iii) obtain the MLE either by Newton- 
Raphson or by quasi-Newton methods. Only in a few cases, the third stage did not converge; 
full EM iteration was then used, and this always led to convegence. 

The set of S-Plus routines developed for these computations, as well as those related to 
the problems discussed later, will be made freely available on the WorldWideWeb. 

6.3 Anomalies of MLE 

Notwithstanding what is stated near the end of the previous subsection, there are still cases 
where the likelihood shape and the MLE are problematic. We are not referring here to diffi- 
culties with numerical maximization, but to the intrinsic properties of the likelihood function, 
not removable by change of parametrization. 

An illustration is provided by Figure[4t here 50 data points, sampled from SN(0, 1, 5), are 
plotted on the horizontal axis, together with a nonparametric estimate of the density (dashed 
curve) and another (continuous) curve representing a skew-normal density. This parametric 
curve has a = 8.14 but it is not the one of the MLE, however: the MLE has a = oo, which 
corresponds to the half-normal density. 

This divergence of a (or equivalently 71 — > 0.99527, its maximal value) looks rather sur- 
prising, since apparently there is nothing pathological in the data pattern of Figure [4l the 
sample index of skewness is 0.9022, which is inside the feasible region of 71. Similar situa- 
tions occur with a non-negligible frequency when n is small to moderate, but they disappear 
when n increases. 

The source of this sort of anomaly is easy to understand in the one-parameter case with £ 
and uj known; £ = 0, oj = 1, say. If all sample values have the same sign, the final term of (fT5"l) 
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increases with ±a, depending the sign of the data but irrespective of their actual values, as it 
has been remarked by Liseo (1990). For instance, if 25 data are sampled from SN(0, 1, 5), the 
probability that they are all positive is about 0.20. 

When all three DP parameters are being estimated, the explanation of this fact is not so 
clear, but it is conceivable that a similar mechanism is in action. 

In cases of this sort, the behaviour of the MLE appears qualitatively unsatisfactory, and an 
alternative estimation method is called for. Tackling this problem is beyond the scope of the 
present paper, however. As a temporary solution we adopted the following simple strategy: 
when the maximum occurs on the frontier, re-start the maximization procedure and stop it 
when it reaches a loglikelihood value not significantly lower than the maximum. This was the 
criterion used for choosing the parametric curve plotted in Figure^ in this case the difference 
from the maximum of the loglikelihood is 2.39, far below the 95% significant point of a x|/2 
distribution. 

The above proposal leaves some degree of arbitrariness, since it does not say exactly how 
much below the maximum to stay. In practice the choice is not so dramatic, because the 
boundary effect involves only a, and when this is large, a > 20 say, the actual shape of the 
density varies very slowly. Moreover, in the numerical cases which have been examined, the 
loglikelihood function was very flat only along the a-axis, while it was far more curved with 
along the location and scale parameters which were then little affected by the specific choice 
of a, within quite wide limits. 

7 Applications to multivariate analysis 
7.1 Fitting multivariate distributions 

In the case of independent observations (yi, . . . ,y n ) sampled from SNfc(£j, f2,a) for i = 
1, . . . , n, the loglikelihood is 

I = -inlog |0| - ^ntr(n-V) + u~\ Vi - &)} (16) 

i 

where 

v = n- 1 J2(yi-&(y l -ti) T - 

i 

The location parameters have been considered to be different having in mind a regression 
context where £j is related to p explanatory variables Xj via 

£j = Xi /3, (i = l,...,n), 

for some p x k matrix f3 of parameters. 

It would be ideal to reproduce in this setting the centred parametrization introduced in 
the scalar case. This approach poses difficulties, and we follow a different direction to obtain 
the MLE. Once the estimates have been computed, they could be converted componentwise 
to the centred parameters. 

The letters y, X, £ will denote the matrices of size n x k, n x p and n x k containing the 
yiS, the XiS, and the £j's, respectively. Also, a notation of type ( m (z) represents the vector 
obtained by applying the function Cm(-) to each element of the vector z. 
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Regarding rj = ui~ l a as a parameter in replacement of a separates the parameters in (fl6l) 
in the following sense: for fixed f3 and rj, maximization of £ with respect ft is equivalent to 
maximizing the analogous function for normal variates for fixed /?, which has the well known 
solution 

(l((3) = V(fi) = n~ l u T u 
where u = (y — Xf3). Replacing this expression in t gives the profile loglikelihood 

f (/3,r?) = -in log 1^)1 - \nk + lJCoM 

with substantial reduction of dimensionality of the maximization problem. Numerical maxi- 
mization of I* is required; this process can be speeded up substantially if the partial deriva- 
tives 

F)P* 

— = X T uV(/3)- 1 -X T Cx(ur 1 ) V T , 
dt TV, N 

— = U (l(U7]), 

are supplied to a quasi-Newton algorithm. Upon convergence, numerical differentiation of 
the gradient leads to approximate standard errors for (3 and rj, hence for a after multiplication 

by ijj. 

The above computational scheme has been used satisfactorily in numerical work with 
non-trivial dimensions of the arrays X, y, /3. Avery simple illustration is provided by Figure [5] 
which refers to a subset of the AIS (Australian Institute of Sport) data examined by Cook & 
Weisberg (1994), which contains various biomedical measurements on a group of Australian 
athletes; we then have k = 4, p = 1, n = 202. Figure [5] displays the scatter plot of each pair 
of the four variables considered superimposed with the contour lines of the marginal density 
obtained by marginalization of the fitted SN4 density. 

Visual inspection of Figure [5] indicates a satisfactory fit of the density to the data. How- 
ever, to obtain a somewhat more comprehensive graphical display, consider the Mahalanobis 
distances 

di = {Vi - Q T n~\yi - 0, (i = l,-..,n), (17) 
which are sampled from a xj. if trie fitted model is appropriate, by using Proposition In 
practice, estimates must be replaced to the exact parameter values in (fT71) . The above di's 
must be sorted and plotted versus the xt percentage points. Equivalently, the cumulative x\ 
probabilities can be plotted against their nominal values l/n, 2/n, . . . , 1; the points should 
lie on the bisection line of the quadrant. This diagnostic method is a natural analogue of a 
well-know diagnostics used in normal theory context (Healy, 1968). 

Figure [6] diplays the second variant of this plot for the AIS data, in its right-hand side 
panel; the left-hand side panel shows the similar traditional plot under assumption of nor- 
mality. Comparison of the two plots indicates a substantial improvement of the skew-normal 
fit over the normal one. 

A similar conclusion is achieved by considering a parametric test for normality which is 
provided by the likelihood ratio test for the null hypothesis a = 0, that is 

2{*(|,n,&)-£(£ ) XJ,0)} 

where (/}, E) denote the MLE of (£, O) under the assumption of normality. The observed 
value of the test statistics in the above example is over 103, and the associated value of the 
xl distribution function does not even need to be computed. 
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Figure 5 : Scatterplots of some pairs of the AIS variables and contour levels of the fitted distribution 
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Figure 6: Healy's plot when either a normal distribution (left panel) or a skew-normal distribution 
(right panel) is fitted to the AIS data 

7.2 Discriminant analysis 

The results of Section [3j once reinterpreted in the more general setting introduced in Sec- 
tion [5l provide tools to examine the behaviour of many classical multivariate techniques, 
when based on linear transforms of the data, in the more general context of SN variables. 
For the present discussion, however, we shall restrict ourselves to a rather simple problem 
of discrimination between two populations, under the traditional hypothesis that they differ 
only in the location parameters. 

If Yi ~ SNfc(£j, Q, a) denote the random variables associated to the two populations (i = 
1,2), then the likelihood-based discrimination rule allocates a new unit with observed vector 
y to population 1 if 

(6 - ^^(y - l(& + £2)) + CoOl) - CoM + logK/vrs) > (18) 

where Wi = Wi(y) = a T uj^ 1 (y — £j) and iti is the prior probability of the i-th population 
(i = 1,2). 

Nonlinearity of the left-hand side of the above inequality prevents explicit solution. How- 
ever, some properties can be obtained; one is that the likelihood-based discriminant function 
is a linear function of y when either of the following conditions holds: 

(6 - i2) T u- 1 a = 0, (19) 
u- 1 a = ctt-\t l -t 2 ) (20) 

where c ia non-zero scalar constant. The proof is omitted. 
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The natural alternative to (fT8l) is the Fisher linear discriminant functions, whose com- 
monly used expression is 

(Ml - M2) T S" 1 (y - + /x 2 )) + log(7ri/7r 2 ) > 0, 

using a self-explanatory notation; in the present case, this can be re-written as 

(6 - £ 2 ) T (^ - ^/ij^)- 1 (y - \{i x + 6 + 2w^)) + log(7ri/7r 2 ) > 0. (21) 

Proposition 11 Mien condition (|T9P hoZds, the discriminant rules 08D and (HU) coincide. 

Proof. First, notice that (fT9l) implies that = w 2 (y) in (fl8l) . Next, use © to invert 

(f2 — ujfj, z fijuj) in (|2TT) . leading to 

(6 - 6) T ^(y - |(6 + 6) - > 0. 

Then, on using ( fT9l ) again and noticing that the vectors r^ 1 ^/^ and have the same 

direction, one obtains the result. 

In the general case, (fT8l) and (|2TT) can only be compared numerically. The various cases 
considered differ for the relative positions of the locations parameters, while the other pa- 
rameters have been kept fixed; specifically, we have set k = 2, tt\ = 7r 2 , oj = h, ^ equal to the 
correlation matrix with off-diagonal elements equal to 0.4, a = (3, 3) T , and ||£i — £ 2 || 2 = 1. 
This choice of the parameters, such that a is an eigenvector of ft, has been made for the sake 
of simplicity, in the following sense. It turns out that the quantities regulating the basic be- 
haviour of the classification rules are the angle 6\ between the vectors uj~ l a and £1 — £ 2 , and 
the angle 6* 2 between uj~ 1 a and il _1 (£i — £ 2 ). The above choice of a and makes it easier 
to choose values of £1 — £ 2 fulfilling conditions ( fT9l ) and ([201) . i.e. such that cos#i = and 

COS 02 = 1. 

Figure [7] shows the relevant entities for a few cases. Each panel of the figure displays the 
contour levels of the two population densities with superimposed the separation lines of the 
two discriminant rules. The bottom-right panel corresponds to a case satisfying ( fT9l ) and only 
one discrimination line is then visible; the top-right panel corresponds to fulfilling (l20l) and 
the two discriminant lines are parallel. 

Table [T] contains summary values of the numerical work, in particular misclassification 
probabilities, for a larger number of cases. For the Fisher rule, classification probabilities can 
be computed exactly with the aid of (TTTl) : for (Tl8l) . the corresponding probabilities have been 
evaluated by simulation methods, using 100000 replicates for each case. The main qualitative 
conclusions from these figures are as follows: (a) the total misclassification probability is 
lower for the likelihood-based rule than for the Fisher linear discriminant, as expected from 
known results (Rao, 1947); (b) the Fisher rule is however not much worse than the other 
one, and its two components are more balanced that the analogous ones of the likelihood- 
based rule, which could be considered as an advantage on its own; (c) for some values of 6\ 
and 62, the fraction of cases which are classified differently by the two rules is not negligible; 
hence the choice of the method can be relevant even if the probabilities of misclassification 
are similar. 

For numerical illustration, we have applied the two discriminant rules to the same subset 
of the AIS data used in subsection \7.1\ The individuals were divided by sex, obtaining two 
groups of 102 male and 100 female athletes, respectively, and prior probabilities were set 
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Figure 7: Contour plots of four pairs of SN 2 variables, with likelihood discriminant function (contin- 
uous line) and Fisher linear discriminant function (dashed line) 
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Table 1: Classification probabilities of likelihood-based and Fisher linear discriminant rules. The 
entries are: p 1L , misclassification error probability using likelihood based rule, when sampling from 
population 1; p\ F , misclassification error probability using Fisher linear discriminant function, when 
sampling from population 1; p 2 L and p 2F are similar quantities in the case of sampling from population 
2; p* , probability that the discriminant rules coincide; B x and 9 2 are angles associated to the relative 
position of the location parameters, as described in the text 
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Allocated groups 
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G 2 
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G4 


Gi 
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0, 


G 2 


2,2 


36, 37 
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0, 


G 3 


0, 


0, 


22, 20 


10, 11 


G 4 


0, 


3,2 


14, 14 


67, 66 


Total 


57 


44 


40 


77 



Table 2: Discrimination of the four groups of the hepatic data; the data indicate the number of 
individuals classified by likelihood rule (first entry) and by the Fisher discriminat function (second 
entry) 

equal to the observed frequencies. In this case 6\ = 1.54041 radians, a situation not so far 
from the one associated with ([T9l) , i.e coincidence of the two discriminant functions. In fact 
the total number of misclassified subjects differs only for one unit: more precisely Fisher 
rule fails in three units, while the likelihood-based one fails in two. Further numerical work 
has been done using data reported by Albert & Harris (1987, chapter 5), fairly often used 
for illustration in the context of discriminant methods. An overall sample of 218 individuals 
affected by liver problems are divided into four groups, corresponding to severity of their 
status: acute viral hepatitis (group G\, 57 patients), persistent chronic hepatitis (G2, 44 
patients), aggressive chronic hepatitis (G3, 40 patients), and post-necrotic cirrhosis (G4, 77 
patients). Albert & Harris (1987) construct a discrimination rule based on data on three 
of four available liver enzymes: aspartate aminotransferase (AST), alanine aminotransferase 
(ALT) and glutamate dehydrogenase (GLDH); the data have been logarithmically transformed 
because of extreme skewness in the original variables. To ease comparison, we employed the 
same variables and applied the same data transformation. 

Goodness-of-fit and graphical diagnostics, along the lines of subsection 17. 1 L confirm the 
adequacy of the skew-normal distribution in modeling this set of variables. Prior probabilities 
were set equal to the observed frequencies, i.e. iti = 0.26, it2 = 0.20, ir^ = 0.18 and ir^ = 
0.35. The summary results, shown in Table [H indicate a slight improvement using the SN 
distribution instead of the normal one, in the sense that 3 data points which were incorrectly 
classified by the Fisher rule are now correctly classified, and only one is moved in the reverse 
direction. 

7.3 Regression and graphical models 

Graphical models are currently a much studied research topic. This subsection examines some 
related issues when the assumption of normal distribution of the variable is replaced by (Q]) . 
We adopt Cox & Wermuth (1996) as a reference text for background material. 

In the construction of a graphical model of normal variables, a key ingredient is the con- 
centration matrix, i.e. the inverse of the covariance matrix, possibly scaled to obtain unit 
diagonal elements. When the (i, j)-th entry of the concentration matrix is 0, this indicates 
that the two corresponding components, Y; and Yj say, are independent conditionally on all 
the others. The associated concentration graph has then no edge between Yi and Yj. 

The results of sections [3] and [5] enable us to transfer the above scheme in the context 
of skew-normality; consider in particular Proposition [6] and expression ( fT4l ) . Hence, two 
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components, Yi and Yj say, of Y ~ SNfc(£, f2, a) are independent conditionally on the others 
if the (i, j)-th entry of f2 _1 is zero and at most one of and a-, is different from zero. Hence 
O -1 plays a role analogous to the concentration matrix in normal theory context, but also a 
must be considered now. 

Building a graphical model from real data involves to follow essentially the strategy pre- 
sented by Cox & Wermuth (1996) for the normal case. The main difference is in the distinc- 
tion between regression and conditioning, which are essentially coincident in the normal case 
but not here. 

Since it seems best to illustrate the actual construction of a graphical model in a specific 
example, we consider the data analysed by Cox & Wermuth (1996, chapter 6), concerning 68 
patients with fewer than 25 years of diabetes. This dataset is of rather small sample size for an 
adequate fitting of a multivariate SN distribution, but it has been adopted here because it is a 
'standard' one in this context. For each patient, eight variables are recorded; of these, glucose 
control (Y) and knowledge about illness (X) are the primary response and the intermediate 
response variables, respectively; the special role of these two variables drives the subsequent 
analysis. Of the other variables, W, A and B are explanatory variables regarded as given, 
with A and B binary; Z, U and V are other stochastic variables. See the above reference for 
a full description of the variables and some background information. 

A preliminary analysis, using the methods described at the end of subsection 17.11 shows 
the presence of a significant skewness in the distribution of some of the variables; this is 
largely due to the X component but not only to this one. Therefore, we introduce a multi- 
variate regression model of type 

(Y,X, Z,U,V)~S£i 5 {Z,n,a) 

where £ is a linear combination of (1, W, A, B), and and a are constant across individuals. 
Fit of the above model, using the algorithm described in section I7TH led to a boundary solu- 
tion, in the sense that the components of a diverged. Adopting the simple method described 
in section l6T3l to handle these cases, a set of parameters has been chosen inside the parameter 
space having a loglikelihood value about 7.7 units lower than the maximum, which is a very 
minor loss in consideration of the large number of parameters being estimated. 

Table [3] gives the partial correlation matrix, &*, which is Q -1 after scaling to obtain unit 
diagonal entries and changing signs of the off-diagonal entries, and the shape parameters 
with standard errors and i-ratios. 

Because of the different role played by the variables in the present problem, the most 
relevant entries of Table [3] are those of the first two rows of fi* . Joint inspection of both 
components of Table [3] indicates conditional independence of (Y, Z), (Y, U) and (Y, V), while 
there is conditional dependence between (X,Z) and between (Y,X). Moreover the results 
concerning the regression component suggest dropping B from the model. 

Additional numerical work not reported here has been carried out to examine the sensi- 
tivity of the results to the choice of the point where the MLE iteration sequence was stopped. 
The overall conclusions are as follows: the regression coefficients and their observed signif- 
icances are stable over a wide range of stopping points; the individual components of a are 
not so stable, but the overall significance of the test for normality described at the end of 
Section [7TTI remains well below 1%. The instability of the components of a is not surprising 
considering that the sample size, n = 68, is small in this context, as discussed in Section [631 

Reduction of the model, dropping components because of the non-significant coefficients 
or because of their irrelevance to the variables of interest, leads to consideration of the triplet 



25 







Y 


X 


Z 


U 


V 


Y 


/ 


1 


-0.49 


0.09 


-0.16 


0.06 \ 


X 




-0.49 


1 


-0.38 


-0.04 


0.17 


Ci* = z 




0.09 


-0.38 


1 


0.42 


-0.25 


u 




-0.16 


-0.04 


0.42 


1 


-0.07 


V 


V 


-0.06 


0.17 


-0.25 


-0.07 


1.00 / 






Y 


X 


Z 


U 


V 


a 




1.53 


-32.89 


-3.49 


-1.16 


-2.41 


std. error 


6.4 


11.68 


2.89 


7.27 


2.70 


t-ratio 




0.24 


-2.81 


-1.21 


-0.16 


-0.89 



Table 3: Matrix Cl*, a and other quantities associated to the regression analysis of {Y, X, Z, U, V) on 
(1, W, A, B) for the glucose data 
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Table 4: Matrix Q*, a and other quantities associated to the regression analysis of (Y, X, Z) on 
(1, W, A) for the glucose data 

(Y, X, Z) with explanatory variables (1,W,A). The new matrix Cl* and the vector a are as 
reported in Table |4) 

The final graphical model has an edge between (X, Y) and between (X, Z) to represent 
conditional dependence, for fixed values of (A, W) as indicated by Cl* and a in Table [4} 
background information can be used to choose a direction on these arcs. Additional arcs are 
added from the fixed variables to the stochastic ones with the aid of the estimates and related 
t-ratios obtained from the last regression analysis, namely directed arcs between (A, Y), and 
(W,Z). 

The pictorial representation of the graphical model is similar to the regression graph of 
Figure 6.4 of Cox & Wermuth (1996, p. 141), except for the arcs for they added on the basis 
of univariate regressions. Clearly, the building procedures and the associated interpretations 
are a bit different, and the two types of arcs (arising from conditional dependence and from 
regression) should be kept graphically distinct in our case. 

We stress again that the above discussion intended to illustrate the use of the conditional 
independence techniques with the aid of a well-known dataset, not to produce a full data 
analysis. Moreover, the estimation method presented in Section l7TT1 must be used with caution 
with small samples like this one. 
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8 An extension to elliptical densities 



The univariate skew-normal distribution was obtained by applying a skewing factor to the 
standard normal density, but the same method is applicable to any symmetric density, as 
stated in Lemma 1 of Azzalini (1985). This lemma can be extended to the fc-dimensional 
case where the notion of symmetric density is replaced by the notion of elliptical density. The 
following lemma is a direct generalization of Lemma 1 of Azzalini (1985), of which it also 
follows the same line of argument in the proof. 

Lemma 12 Denote by X a continuous random variable with density function G' symmetric 
about and by Y = (Y\, . . . , Y/,) T a continuous random variable with density function f, such 
that X and Y are independent. Suppose that the real-valued transform W(Y) has symmetric 
density about 0. Then 

f(y) = 2f(y)G(W(y)) (22) 

is a k-dimensional density function. 

Proof. Since X — W(Y) is symmetric about 0, then 

\ = ¥{X < W(Y)} = E Y {F{X < W(Y)\Y}} = [ G(W(y)) f(y) dy. 

Corollary 13 Suppose that X and Y satisfy the conditions of the above lemma, and in addition 
that Y has elliptical density centred at the origin; if 

W{Y) = aiYi + • • • + a k Y k = a T Y (23) 

then d2ZP is a k-dimensional density function for any choice of a. 

Proof. The statement follows by noticing that a T Y has 1 -dimensional elliptical distribution, 
i.e. its density is symmetric about 0. See Theorem 2.16 of Fang, Kotz & Ng (1990) for the 
distribution of a linear transform of elliptical variables. 

Clearly, (1221) with W{y) of type (1231) includes the SN fc density for suitable choice of /, G 
and any choice of a. 

In principle, Lemma [12] can be applied also to non-elliptical densities. For instance, if 
Y ~ SNfc and a is chosen suitably, according to Proposition [3l the density of W can be made 
normal, hence symmetric. There is however a major difference: in this case, the property 
holds for specific choices of a depending on the given choice of /, while with the elliptical 
densities it holds for all a's. 

Implicit in the proof of the lemma there is an acceptance-rejection idea, hence a condi- 
tioning argument, similar to the one of Azzalini (1986), leading to the following method for 
random number generation. If X and Y are as in above lemma, and 

[Y \fX<W(Y), 
\-Y i(X>W(Y), 

then the density function of Z is (1221) . In fact, its density at point z is 

f(z)G(W(z)) + f(-z){l - G(W(-z))} 

which is equal to 2 f(z) G(W(z)) if f(z) = f(—z), a condition fulfilled e.g. by elliptical densi- 
ties centred at 0. 
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9 Further work 



Various issues related to the SN family have been discussed, but many others remain pending. 
Broadly speaking, these fall in two categories: open questions and further applications. 

Among the open questions, the anomalous behaviour of MLE in cases described in sec- 
tion 16.31 is worth exploration even per se. In the multivariate case, construction of more 
accurate standard errors would be welcome. A more radical solution would be the introduc- 
tion of the centred parametrization which has not been carried on from the univariate to the 
multivariate case. 

Besides applications to numerically more substantial applied problems than those dis- 
cussed here, it is worth exploring the relevance of the distribution in other areas of multivari- 
ate statistics, in addition to those touched in section A natural aspect to consider is the 
behaviour of other linear statistical methods outside normality, not only discriminant anal- 
ysis. Another relevant use could be in connection with sampling affected by bias selection; 
this has been discusses by Copas & Li (1997) and references quoted therein, in the case of a 
scalar response variable. The skew-normal distribution offers the framework for a multivari- 
ate treatment of the same problem, by consideration of its genesis via conditioning. 

The generalization to skew-elliptical densities has been left completely unexplored. An 
adequate treatment of the connected distributional and statistical issues requires the space of 
an entire paper. Hence, this direction has not been explored here, but a brief mention seemed 
to be appropriated, partly because of its close connection with the SN distribution. 
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Appendices 

Two equivalent parametrizations 

We want to show that the (O, a) parametrization adopted in this paper is equivalent to the 
(A, V) parametrization of Azzalini & Dalla Valle (1996). 

The matrix ft and the vector a appearing in ([]]) were defined in Azzalini & Dalla Valle 
(1996) in terms of a correlation matrix \& and a vector A = (Ai, . . . , A^) T ; specifically, they 
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defined 

A = diag((l + A?)- 1/2 ,...,(l + A^)- 1/2 ), (24) 

n = A(^ + AA T )A, (25) 

a = (i + A t * _1 a) A -1 *^. (26) 

. -1/2 



Also, they defined 8 = (8\, . . . , 8^) where 8j = Xj + AjJ for j = 1, . . . , k. 

With some algebraic work, it can be shown that (1251) and (1261) are invertible, obtaining 

* = A _1 (n-« T )^ -1 C27) 

and ([3]), which then gives A using Xj = 8j ( 1 — 5| J . As a by-product, ([5]) is obtained. 



Clearly, for any choice of the (A, pair, we obtain a feasible (f2, a) pair; hence, we must 
only show the following. 

Proposition 14 For any choice of the correlation matrix and of the vector a £ M fc , (QJ is a 
density of SN^ type. 

Proo/ Given a and $7, compute J using ([3]) . This vector must satisfy condition Q — 88 T > 0, 
required by ([271) ; hence we must check that 

n - (1 + a T J2a) -1 f2oa T n > 0. 

By using ©, the left-hand side can be seen to be equal to + aa T ) _1 which is positive 
definite. Moreover, fulfillment of ft — 88 T > implies that all components of 8 are less than 
1 in absolute value. Algebraic equivalence of J24l) - (l26l) and ©, J27l) completes the proof. 

Gradient and hessian of the centred parameters 

The partial derivatives of £(CP) defined in Section [6721 with respect to a, X) are 

dl 

— = (a z /a) 2 X T {y-XP-aa; 1 (X Pl - f x z l n )}, 
& = - n / a + az (y- X f3) T (z- Pl X)/o- 2 , 

- = -o z -z z + Pl (z + Xz) 

where z' denotes the derivative with respect to A, and 

Pi = Ci(Az), z' = fi' z + a~ x {y - X/3)a' z =n' z + ra' z , 
, _ (2/tt) 1/2 , _M* / 

^- ( l + A 2)3/2' °*- a /f 

To obtain the partial derivatives with respect to 71, use 

dl _ dl d 7i d 7i _ 3(4 -tt) n 2 z (n' z a z - fi z a' z ) 

a.. fl\/ ji ' j \ o _4 



9 7 i dA' dA dA 2 a 
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z 



di di dA dA 2/1 1 - 2/vr 

+ 



or equivalently 



3 7 i dX d7i' d 7l 3(4 -vr) V^i? 2 T 3 
where 

U = = (j^) ^ T = (2/vr - (1 - 2/vr)i? 2 ) 1/2 . 

The above derivatives lead immediately to the likelihood equations for CP = (/3, a, 71). 
We need second derivatives for numerical efficient computations, and for computing the ob- 
served information matrix. The entries of the Hessian matrix for (/?, a, A) are given by 

= (a z /a) 2 X T (I n -X 2 P 2 )X, 



T 



d/3d/3 

(a z /a 2 )X T (z-X Pl + (I n -X 2 P 2 )(z- f i z l n )), 



d(3da 
d 2 i 

~~d~(3dX = ^^{v'zi- 2 ™* + X Pi ~ Mn) + <r*(pi + W - n' g l n )}, 

.0 = a ~2 { _ n + 2(Jzr T {z _ Xpi) + CT 2 r T (/n _ A 2p 2)r}) 

d 2 £ 

= -^V (<^(z - Api) + <r z (z' - pi - AP 2 z)) , 

= w (^) 2 - ^< + (2 /)T z / + Z T Z „ _ ~Tp 2 ~ _ p T {2z , + A/ / } 



where 



r = a' 1 (y-Xp), z = z + Xz' , 

Pi = Ci(Az), P 2 = diag(p 2 ) = diag(C 2 (Az)), 

z" = ^ = ^ + a':a-\y-X(3), 

.// _ d/4 _ 3// 2 _„ _ da' z _ ( n' z {n' z a z - n z a' z ) /i z ^ 

Cft Oy. 



M * ~ ~dA~ ~ "(1 + A 2 ) 2 ' *~~dT~ * ~ 2 



Again, to obtain the Hessian matrix with respect to 71 instead of A, the last row and last 
column of the above matrix must be multiplied by dA/ d 7 i, except the bottom right element 
which is computed as 

d 2 l _ _&H_ t dA\ 2 _ dl_ /d 2 A\ 

~9 7l 2 ~ ~dx 2 V d 7 J " dx V d 7l 2 y ' 

The final term of this expression is given by 

d 2 A _ 2 / V IB! 3(1 - 2/tt)T' 
d^ ~ ~3(4-tt) [jTRf + ^3 + ^4 

where 

. dii 2 , dT R R' 

* = d^ = 3iP(4-,r)' T = ^ = ~ {l ~ 2M ~r- 
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For practical numerical work, the above quantities suffices. If the expected Fisher infor- 
mation matrix Iqp is needed, this is given by 



I C p = D T I DP D 



where Ip>p is the information matrix for the DP parameters, given by Azzalini (1985) in the 
case X = l n , and 

dco 



where 



D 



( d(DP)j 
\d(CP) 3 



3cr z 7i ' 



1 1 


^£ 








1 









<Jz 


1° 






dco 



dX 

aa' z dA 
cj2 d7i' 
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